14 research outputs found
Pretenuring for Java
Pretenuring is a technique for reducing copying costs in garbage collectors. When pretenuring, the allocator places long-lived objects into regions that the garbage collector will rarely, if ever, collect. We extend previous work on profiling-driven pretenuring as follows. (1) We develop a collector-neutral approach to obtaining object lifetime profile information. We show that our collection of Java programs exhibits a very high degree of homogeneity of object lifetimes at each allocation site. This result is robust with respect to different inputs, and is similar to previous work on ML, but is in contrast to C programs, which require dynamic call chain context information to extract homogeneous lifetimes. Call-site homogeneity considerably simplifies the implementation of pretenuring and makes it more efficient. (2) Our pretenuring advice is neutral with respect to the collector algorithm, and we use it to improve two quite different garbage collectors: a traditional generational collector and an older-first collector. The system is also novel because it classifies and allocates objects into 3 categories: we allocate immortal objects into a permanent region that the collector will never consider, long-lived objects into a region in which the collector placed survivors of the most recent collection, and shortlived objects into the nursery, i.e., the default region. (3) We evaluate pretenuring on Java programs. Our simulation results show that pretenuring significantly reduces collector copying for generational and older-first collectors. 1
Recommended from our members
Data reorganization for improving cache performance of object-oriented programs
Hardware trends have increased the disparity of processor and main memory performance. Processors are becoming faster and faster while main memory performance has not kept pace. It takes many cycles to fetch data from main memory. Data caches, which are much faster and much smaller in size, bridge the performance gap to some extent by storing the most frequently accessed data items. For the best performance, it has become crucial to have in the cache almost all the data needed rather than having to access the much slower main memory. It is a very active research area to explore both hardware and software techniques to increase the cache hit rate to achieve maximum performance. Software techniques have been quite successful in the case of regular scientific programs where data accesses are fairly predictable (linear loop nests over arrays). However, for object-oriented programs where data accesses are less predictable, there has been little success so far. In this work we explore run-time software techniques to improve cache performance of object-oriented programs, in particular, Java. We propose a range of data layout schemes using run-time profiling and feedback. We demonstrate how the Java class loader and the garbage collector may cooperate to produce better run-time data organizations. In particular, we explore different data traversals by the garbage collector with and without using profile information. We demonstrate the effectiveness of our techniques with an implementation using the IBM Jalapeño optimizing compiler and virtual machine and the UMass garbage collection toolkit. We identify conditions when and to what extent our techniques are likely to be useful in real world object-oriented programs
Loop Fusion for Data Locality and Parallelism
Modern processors use memory hierarchy of several levels. Achieving high performance mandates the effective use of the cache locality. Compiler transformations can relieve the programmer from handoptimizing for the specific machine architectures. Loop fusion is a reordering transformation that merges multiple loops into a single loop. It can increase data locality thereby exploiting better cache locality; it can also increase the granularity of parallel loops, thereby decreasing barrier synchronization overhead and improving program performance. However, very large granularity loops are undesirable, if they introduce register spills inside the loop. Previous approaches to the fusion problem have considered all these factors in isolation. In this work, we present a new model which considers data locality and parallelism together subject to the register pressure. We build a weighted directed acyclic graph, called the fusion graph, in which the nodes represent loops and the weights on edg..
A Parametrized Loop Fusion Algorithm for Improving Parallelism and Cache Locality
Loop fusion is a reordering transformation that merges multiple loops into a single loop. It can increase data locality and the granularity of parallel loops, thus improving the program performance. Previous approaches to this problem have looked at these two benefits in isolation. In this work, we propose a new model which considers data locality, parallelism, and register pressure together. We build a weighted directed acyclic graph in which the nodes represent program loops along with their register pressure, and the edges represent the amount of locality and parallelism present. The direction of an edge represents an execution order constraint. We then partition the graph into components such that the sum of the weights on the edges cut is minimized, subject to the constraint that the nodes in the same partition can be safely fused together, and the register pressure of combined loop does not exceed the number of available registers. Previous work demonstrates that the general problem of finding optimal partitions is NP-hard. In restricted cases, we show that it is possible to arrive at the optimal solution. We give an algorithm for the restricted case and a heuristic for the general case. We demonstrate the effectiveness of fusion and our approach with experimental results
A Parallel Implementation of a Correspondence-Finder for Uncalibrated Stereo Image Pairs
We report on our experience with parallelizing a computer vision algorithm. The algorithm employs low-level image processing techniques which are relatively easy to parallelize and intermediate-level computer vision techniques which lack the regularity and locality of image processing algorithms. The application is an excellent candidate for use as a benchmark. We implement two parallel versions of this algorithm; the second one based on our experience with the first version. We program the parallel implementation in the Single Program, Multiple Data (SPMD) model using the MPI message passing interface. We evaluate our implementation on a four node IBM SP. Our results show excellent speedup numbers for the image processing portion and good speedups for most of the application. However, part of the application is inherently sequential. Our second parallel implementation is not only more efficient than the first, but it also has better speedup numbers. In addition, we suggest changes to ..
Pretenuring For Java
Pretenuring can reduce copying costs in garbage collectors by allocating long-lived objects into regions that the garbage collector will rarely, if ever, collect. We extend previous work on pretenuring as follows. (1) We produce pretenuring advice that is neutral with respect to the garbage collector algorithm and configuration. We thus can and do combine advice from different applications. We find that predictions using object lifetimes at each allocation site in Java programs are accurate, which simplifies the pretenuring implementation. (2) We gather and apply advice to applications and the Jalapeño JVM, a compiler and run-time system for Java written in Java. Our results demonstrate that building combined advice into Jalapeño from different application executions improves performance regardless of the application Jalapeño is compiling and executing. This build-time advice thus gives user applications some benefits of pretenuring without any application profiling. No previous work pretenures in the run-time system. (3) We find that application-only advice also improves performance, but that the combination of build-time and application-specific advice is almost always noticeably better. (4) Our same advice improves the performance of generational and Older First collection, illustrating that it is collector neutral